Goto

Collaborating Authors

 bregman divergence


Latent Process Generator Matching

arXiv.org Machine Learning

A related situation arises when an auxiliary process is introduced to aid training but modelling its dynamics at generation time is unnecessary or difficult, as in Billera et al. [2025b] and Kim et al. [2025]. In each of these works, the projection result and its associated loss are derived on a case-by-case basis, and all theorems are restricted to marginalization over a discrete component of the extended state space. We introduce a general framework that removes these restrictions: given a time-inhomogeneous Feller process (Yt)0 t 1 on an arbitrary state space Y and a map Φ: Y X, one may learn a linear parametrisation of the generator of a Feller process on X whose one-time marginals coincide with those of (Φ(Yt))0 t 1. For Y = X Z and Φthe projection onto the first coordinate, this subsumes these prior works as special cases, allowing for a general class of latent processes (Zt)0 t 1 in a nearly arbitrary state space Z, using the formalism of generator matching to allow for continuous, discrete, or manifold-valued processes. In particular, the learnt process at t = 1 samples from the distribution of Φ(Y1), which is the desired data distribution. We give sufficient conditions for a loss function to be valid in this general setting, recovering the results of the works cited above as corollaries. This result has broad applicability, enabling the construction of a wide array of new flow matching schemes by allowing for a more general class of latent spaces. As a concrete new application, we outline a non-projection Φ: Y X with manifold-valued latents for protein structure generation that separates chain-level rigid-body motion from internal flexibility ( 4), where the particular chain-level versus residue-level or internal state is latent, and the model only sees the world state, which we plan to implement in future work. 2 EARLIERWORK Several recent generative models train with the aid of a latent stochastic process that is marginalised out at generation time.


Calibeating for general proper losses: A Bregman divergence approach

arXiv.org Machine Learning

This work introduces a general framework for calibeating based on regret minimization. As compared to Foster and Hart's seminal calibeating work which had specialized treatments of Brier score (squared loss) and log loss, we consider a large family of proper losses that includes $α$-Tsallis losses (for $α\in [1, 2]$) and Lipschitz losses. Our results for Tsallis losses also hold for an unscaled version of Tsallis loss that recovers log loss. Our analysis is oriented around the Bregman divergence view of a proper loss. Technically, our results for the family of Tsallis losses that we consider are U-calibration results, simultaneously obtaining logarithmic regret for all losses in this family while having a weaker dependence on the dimension compared to previous results. Of potential independent interest, we also show a new regret equality for the regret of Be The Regularized Leader. This regret equality holds for general proper losses and itself is based on two results related to online updating formulas for the generalized variance, the latter being a previously introduced generalization of variance based on Bregman divergences.


Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning

arXiv.org Machine Learning

The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of $Θ(1/K^2)$, which we establish via the Ditzian-Totik theorem. Finally, we synthesize these two approaches to propose a novel Variance-Optimal Augmented Polynomial Optimization Program (AQP) Estimator, proving that by systematically reducing variance, our method achieves not only optimal bias but also provably accelerated game convergence, leading to more efficient and stable training with zero online computational overhead.



Clustering with Bregman Divergences: an Asymptotic Analysis

Neural Information Processing Systems

Clustering, in particular k-means clustering, is a central topic in data analysis. Clustering with Bregman divergences is a recently proposed generalization of k-means clustering which has already been widely used in applications. In this paper we analyze theoretical properties of Bregman clustering when the number of the clusters k is large. We establish quantization rates and describe the limiting distribution of the centers as k, extending well-known results for k-means clustering.


Learning Bregman Divergences with Application to Robustness

Neural Information Processing Systems

We propose a novel and general method to learn Bregman divergences from raw high-dimensional data that measure similarity between images in pixel space. As a prototypical application, we learn divergences that consider real-world corruptions of images (e.g., blur) as close to the original and noisy perturbations as far, even if in $L^p$-distance the opposite holds. We also show that the learned Bregman divergence excels on datasets of human perceptual similarity judgment, suggesting its utility in a range of applications. We then define adversarial attacks by replacing the projected gradient descent (PGD) with the mirror descent associated with the learned Bregman divergence, and use them to improve the state-of-the-art in robustness through adversarial training for common image corruptions. In particular, for the contrast corruption that was found problematic in prior work we achieve an accuracy that exceeds the $L^p$- and the LPIPS-based adversarially trained neural networks by a margin of 27.16\% on the CIFAR-10-C corruption data set.


A scaled Bregman theorem with applications

Neural Information Processing Systems

Bregman divergences play a central role in the design and analysis of a range of machine learning algorithms through a handful of popular theorems. We present a new theorem which shows that ``Bregman distortions'' (employing a potentially non-convex generator) may be exactly re-written as a scaled Bregman divergence computed over transformed data. This property can be viewed from the standpoints of geometry (a scaled isometry with adaptive metrics) or convex optimization (relating generalized perspective transforms). Admissible distortions include {geodesic distances} on curved manifolds and projections or gauge-normalisation. Our theorem allows one to leverage to the wealth and convenience of Bregman divergences when analysing algorithms relying on the aforementioned Bregman distortions. We illustrate this with three novel applications of our theorem: a reduction from multi-class density ratio to class-probability estimation, a new adaptive projection free yet norm-enforcing dual norm mirror descent algorithm, and a reduction from clustering on flat manifolds to clustering on curved manifolds. Experiments on each of these domains validate the analyses and suggest that the scaled Bregman theorem might be a worthy addition to the popular handful of Bregman divergence properties that have been pervasive in machine learning.


Representation Learning of Compositional Data

Neural Information Processing Systems

We consider the problem of learning a low dimensional representation for compositional data. Compositional data consists of a collection of nonnegative data that sum to a constant value. Since the parts of the collection are statistically dependent, many standard tools cannot be directly applied. Instead, compositional data must be first transformed before analysis. Focusing on principal component analysis (PCA), we propose an approach that allows low dimensional representation learning directly from the original data.